TSL sequence service is a raw sequence data management service and it can be excessed using the link <sequences.tsl.ac.uk>. The link can be excessed when you are using NBI network or connect to NBI network using VPN. The purpose of the TSL sequence service is to store the raw data securely in a well organised way through Projects, Samples and Runs, capturing relevant metadata, all to reflect the data structures for easier ENA submission.
The top level container is the 'project', this holds the metadata about the study and is made of many 'samples' (which are equivalent to different experimental conditions) and in turn a 'sample' holds many 'runs' (which are roughly similar to runs of a sequencer machine).
Raw datasets are provided by your sequencing providers/companies. These data need to be downloaded. The raw data can be very large files. You can temporarily store files at /tsl/data/dropbox/. You may also download a copy to your external drive.
The best way to upload raw reads data is to move any combination of folders that each contain all files and only files that are raw reads files (and associated MD5 sum files) associated with a given run submission.
Otherwise, if you have downloaded the data to the hpc dropbox area, the best way to upload the data is using a browser in remote desktop service (RDS). As dropbox folder is mounted to RDS, uploading very large files is much faster. You can also use local browser in your laptop, however, uploading large files (stored in hpc dropbox or your external drive) can be slow, especially when you are using eduroam wifi network instead of using network cable at your desk.
Here, I will show you how to upload your data to TSL sequence service using RDS.
Follow the steps below to connect to RDS:
Open chrome or firefox or microsoft edge browser (old or alternative browsers may not work)
Go to the remote desktop service address https://winrds.nbi.ac.uk/RDWeb/webclient/
Enter your user email (your_username@nbi.ac.uk) and password
You will see the window remote desktop as shown below.
Open firefox or chrome browser. Open a new tab in the browser and type the TSL sequence service address http://sequences.tsl.ac.uk
Enter your username and password. After successful login, you will
see the home page similar to the one below:
There are list of projects listed at the left hand side. You may not see the projects listed out if you are logging in for the first time.
A form to fill up about project details will be displayed.
After a project is created, the project pages shows up like below:
You are ready to add samples for the project you have just created.
The next page displayed will show the details of the sample you have
just created.
At this stage, you can create more samples by clicking the project title.
You will then go back to the project details page, where you will now find the sample you have created is listed.
To add new sample, start the same steps as before.
Let’s add dataset to the sample now.
The fields - sequencing technology, library source, library selection, library type and library strategy have dropdown lists and you will need to select the option for your dataset. See what I have selected above.
After you have filled up the form above, you will select the files in the Raw reads section Depending upon where your raw read data files are located, you have two choices: HPC upload or Local filesystem upload. The links to these choices are just below Raw reads section title. By default, HPC Uploads is selected.
This is a new feature added to the system. If your files are in the HPC file directory at /tsl/data/tempWebUploadToSequences, you can use this choice. This is the recommended choice as the files are simply copied rather than uploading. The limitation is that a given Run submission will only have all files selected from 1 given directory. This choice uploads files very quick, even for very large files. Please read the points under Please read carefully. Please check the two check boxes to declare you have read instructions and you are storing the files in the folder /tsl/data/tempWebUploadToSequences temporarily. As you clicked the two check boxes, you will see an empty input form completing the path: /tsl/data/tempWebUploadToSequences, which for illustration purposes I have entered the word 'cheese' into.
Now to prepare your files for uploading do as following:
mkdir /tsl/data/tempWebUploadToSequences/sampleA
md5sum /path/to/sampleA/sampleA_R1.fastq.gz
It will display the md5 like below:
274fd10aa73065e85d5672edcd07e80c /path/to/sampleA/aampleA_R1.fastq.gz
Copy the FASTQ files to the folder.
cp /path/to/sampleA/sampleA_R1.fastq.gz /tsl/data/tempWebUploadToSequences/sampleA
If it’s paired end, copy both R1 and R2 files to the folder. If you have multiple sets of R1 and R2 FASTQ files for the sample, copy all the files to the folder.
Now, type the folder name to search your files. For example, I have named SampleA folder, so I will search with that name and it will display all files in the folder for my sample. See below:
Fill up the md5 checksum for all files. You have generated using the md5sum command above. In the sibling column, choose the R2 FASTQ file for R1 FASTQ file and vice versa. If you have selected FASTQ-single in the form above, you will not get this sibling column.
If everything is fine, i.e. all files are displayed and you have filled up correct md5 checksum for each files and Siblings (for FASTQ paired only), click “Validate and lock choices”. You still have time to revise your form here. If all looks fine, click “Lock choices”. At this stage, you cannot make changes in the form.
Below, you have option to upload any additional files for the sample.
Now, scroll down and check the two checkboxes. Please read before you check the checkboxes.
Click “Create run” button to upload the FASTQ files to sequences.tsl.ac.uk.
This is the same feature as before. You will need to browse the filesystem to select your raw FASTQ data files. Here, If I have selected Library type as FASTQ-paired in the form above, therefore, it expects me to upload two files - forward (R1) and reverse (R2) sequences. If FASTQ-single was selected, it expects a single FASTQ file.
I have navigated to my folder \\tsl-hpc-data\HPC-Data\dropbox\ram (see above), where my test FASTQ files are located. Select the file and click “open” to upload forward (R1) FASTQ file. Do the same to upload reverse (R2) FASTQ file.
You will also need to supply the md5 checksum for the FASTQ files. Your sequence provider will also give you the md5 checksum for the files.
If you have, copy/paste the md5 checksum for the files in the box below where it says MD5 (required) for the respective files.
If you don’t have md5 checksum, you can generate it from HPC command line using the command
md5sum /path/to/filename
There is a link to generate md5 for the files. But be warned that if your file is corrupted, the md5 generated will be different and will not represent the original file. So it is best to use the md5 provided by your sequencing provider.
Here, if we have more dataset for the sample, click “+Add another” that will give you option to browse the files to upload. Add the files and md5 checksum in the same way as we did before. If no more dataset to add, you don’t need to click “+Add another”.
If you have addtional files relating to the dataset, you can browse the files and add here. If not, do nothing.
Check the box to confirm and then click “create run”. If the button is not clickable, there is something you have not added. Add all the details and try again.
Congratulations, you have successfully created a project and a sample and then uploaded the raw data files for the samples in TSL sequences servce.
The raw datasets uploaded to TSL sequences can be accessed in the HPC for analysis. The file paths to the datasets have a format like below
/tsl/data/reads/{your_group}/{your_project}/{sample_name}/{run_name}/raw/{fastqfilename}
Have you noticed I mentioned your_group above. There are following groups directory in “/tsl/data/reads” and each one is named after the username of your group leader.
For example, I have following details
project_name - differential_gene_expression_in_tomato
sample_name - sp2206
run_name - sp2206_001
fastqfiles - dataset_1.fastq.gz, dataset_2.fastq.gz
My group is bioinformatics and after I created the project and added the sample with a samplename and raw files given above, my files will be in the path:
/tsl/data/reads/bioinformatics/differential_gene_expression/sp2206/sp2206_001/raw/dataset_1.fastq.gz
/tsl/data/reads/bioinformatics/differential_gene_expression/sp2206/sp2206_001/raw/dataset_2.fastq.gz
Happy Data Analysis !!!